Skip to content

[Feature] implement async LoRA prefetch#14190

Closed
glenliu21 wants to merge 5 commits intosgl-project:mainfrom
glenliu21:lora_prefetch
Closed

[Feature] implement async LoRA prefetch#14190
glenliu21 wants to merge 5 commits intosgl-project:mainfrom
glenliu21:lora_prefetch

Conversation

@glenliu21
Copy link
Contributor

Motivation

This PR addresses #8712. I used the prefetch policy described in S-Lora, where LoRA adapters are prefetched based on what requests are on the Scheduler's waiting queue.

Modifications

  • Added max_loras_prefetch as a server argument
  • Implement creation of a ForwardBatch as a LoRA prefetch batch, which consists of requests that are next to be ran on the Scheduler's waiting queue
  • Implement the LoRA prefetch backend in LoRAManager, the memory pool, and the LoRA backend
  • Utilize ThreadPoolExecutor and a separate torch.cuda.Stream to enable async prefetching

Accuracy Tests

  • Added a basic, end-to-end test to ensure that enabling LoRA prefetching doesn't change expected outputs

Benchmarking and Profiling

@ConnorLi96 ran the following commands to benchmark LoRA prefetching:

  1. for i in {1..16}; do curl -s -X POST http://0.0.0.0:30001/load_lora_adapter -H 'Content-Type: application/json' -d "{\"lora_name\": \"adapter${i}\", \"lora_path\": \"/workspace/adapters/llama_3_1_8B_adapter\"}"; echo " ✓ adapter${i}"; done
  2. python3 -m sglang.bench_serving --backend sglang --base-url http://localhost:30001/ --dataset-name random --num-prompts 100 --request-rate 4 --random-input-len 2048 --random-output-len 1024 --disable-ignore-eos --disable-tqdm --lora-name adapter1 adapter2 adapter3 adapter4 adapter5 adapter6 adapter7 adapter8 adapter9 adapter10 adapter11 adapter12 adapter13 adapter14 adapter15 adapter16

This yielded the following results:

Before

----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22579.58  
Median E2E Latency (ms):                 22261.88  
---------------Time to First Token----------------
Mean TTFT (ms):                          16157.50  
Median TTFT (ms):                        15918.48  
P99 TTFT (ms):                           34927.59  

After

----------------End-to-End Latency----------------                                         
Mean E2E Latency (ms):          17620.85                                          
Median E2E Latency (ms):         16273.82                                          
---------------Time to First Token----------------                                         
Mean TTFT (ms):             11926.84                                          
Median TTFT (ms):            10865.44                                          
P99 TTFT (ms):              26765.88      

These show about a 31% decrease in TTFT and a 27% decrease in E2E latency.

Checklist

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants